18 research outputs found
The Stable Signature: Rooting Watermarks in Latent Diffusion Models
Generative image modeling enables a wide range of applications but raises
ethical concerns about responsible deployment. This paper introduces an active
strategy combining image watermarking and Latent Diffusion Models. The goal is
for all generated images to conceal an invisible watermark allowing for future
detection and/or identification. The method quickly fine-tunes the latent
decoder of the image generator, conditioned on a binary signature. A
pre-trained watermark extractor recovers the hidden signature from any
generated image and a statistical test then determines whether it comes from
the generative model. We evaluate the invisibility and robustness of the
watermarks on a variety of generation tasks, showing that Stable Signature
works even after the images are modified. For instance, it detects the origin
of an image generated from a text prompt, then cropped to keep of the
content, with + accuracy at a false positive rate below 10.Comment: Website at https://pierrefdz.github.io/publications/stablesignatur
Superfilamentation in air
The interaction between a large number of laser filaments brought together
using weak external focusing leads to the emergence of few filamentary
structures reminiscent of standard filaments, but carrying a higher intensity.
The resulting plasma is measured to be one order of magnitude denser than for
short-scale filaments. This new propagation regime is dubbed
superfilamentation. Numerical simulations of a nonlinear envelope equation
provide good agreement with experiments.Comment: 5 pages, 4 figure
Rewarded soups: towards Pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards
Foundation models are first pre-trained on vast unsupervised datasets and
then fine-tuned on labeled data. Reinforcement learning, notably from human
feedback (RLHF), can further align the network with the intended usage. Yet the
imperfections in the proxy reward may hinder the training and lead to
suboptimal results; the diversity of objectives in real-world tasks and human
opinions exacerbate the issue. This paper proposes embracing the heterogeneity
of diverse rewards by following a multi-policy strategy. Rather than focusing
on a single a priori reward, we aim for Pareto-optimal generalization across
the entire space of preferences. To this end, we propose rewarded soup, first
specializing multiple networks independently (one for each proxy reward) and
then interpolating their weights linearly. This succeeds empirically because we
show that the weights remain linearly connected when fine-tuned on diverse
rewards from a shared pre-trained initialization. We demonstrate the
effectiveness of our approach for text-to-text (summarization, Q&A, helpful
assistant, review), text-image (image captioning, text-to-image generation,
visual grounding, VQA), and control (locomotion) tasks. We hope to enhance the
alignment of deep models, and how they interact with the world in all its
diversity
Ćdition sĆ©mantique dāimages Ć partir de requĆŖtes textuelles
Lāobjectif de cette theĢse est de proposer des algorithmes pour la taĢche dāeĢdition dāimages baseĢe sur le texte (TIE), qui consiste aĢ eĢditer des images numeĢriques selon une instruction formuleĢe en langage naturel. Par exemple, eĢtant donneĢ une image dāun chien et la requeĢte "Changez le chien en un chat", nous voulons produire une nouvelle image ouĢ le chien a eĢteĢ remplaceĢ par un chat, en gardant tous les autres aspects de lāimage inchangeĢs (couleur et pose de lāanimal, arrieĢre- plan). Lāobjectif de lāeĢtoile du nord est de permettre aĢ tout un chacun de modifier ses images en utilisant uniquement des requeĢtes en langage naturel. Une des speĢcificiteĢs de lāeĢdition dāimages baseĢe sur du texte est quāil nāy a pratiquement pas de donneĢes dāentraiĢnement pour former un algorithme superviseĢ. Dans cette theĢse, nous proposons diffeĢrentes solutions pour lāeĢdition dāimages, baseĢes sur lāadaptation de grands modeĢles multimodaux entraiĢneĢs sur dāeĢnormes ensembles de donneĢes. Nous eĢtudions tout dāabord une configuration dāeĢdition simplifieĢe, appeleĢe eĢdition dāimage baseĢe sur la recherche, qui ne neĢcessite pas de modifier directement lāimage dāentreĢe. Au lieu de cela, eĢtant donneĢ lāimage et la requeĢte de modification, nous recherchons dans une grande base de donneĢes une image qui correspond aĢ la modification demandeĢe. Nous nous appuyons sur des modeĢles multimodaux dāalignement image/texte entraiĢneĢs sur des ensembles de donneĢes aĢ lāeĢchelle du web (comme CLIP) pour effectuer de telles transformations sans aucun exemple. Nous proposons eĢgalement le cadre SIMAT pour eĢvaluer lāeĢdition dāimages baseĢe sur la recherche. Nous eĢtudions ensuite comment modifier directement lāimage dāentreĢe. Nous proposons FlexIT, une meĢthode qui modifie iteĢrativement lāimage dāentreĢe jus- quāaĢ ce quāelle satisfasse un "objectif dāeĢdition" abstrait deĢfini dans un espace dāinteĢgration multimodal. Nous introduisons des termes de reĢgularisation pour imposer des transformations reĢalistes. Ensuite, nous nous concentrons sur les modeĢles de diffusion, qui sont des modeĢles geĢneĢratifs puissants capables de syntheĢtiser de nouvelles images conditionneĢes par une grande varieĢteĢ dāinvites textuelles. Nous deĢmontrons leur polyvalence en proposant DiffEdit, un algorithme qui adapte les modeĢles de diffusion pour lāeĢdition dāimages sans reĢglage fin. Nous proposons une strateĢgie "zero-shot" pour trouver automatiquement ouĢ lāimage initiale doit eĢtre modifieĢe pour satisfaire la requeĢte de transformation de texte.The aim of this thesis is to propose algorithms for the task of Text-based Image Editing (TIE), which consists in editing digital images according to an instruction formulated in natural language. For instance, given an image of a dog, and the query "Change the dog into a cat", we want to produce a novel image where the dog has been replaced by a cat, keeping all other image aspects unchanged (animal color and pose, background). The north-star goal is to enable anyone to edit their images using only queries in natural language. One specificity of text-based image editing is that there is practically no training data to train a supervised algorithm. In this thesis, we propose different solutions for editing images, based on the adaptation of large multimodal models trained on huge datasets. We first study a simplified editing setup, named Retrieval-based image edit- ing, which does not require to directly modify the input image. Instead, given the image and modification query, we search in a large database an image that corresponds to the requested edit. We leverage multimodal image/text alignment models trained on web-scale datasets (like CLIP) to perform such transformations without any examples. We also propose the SIMAT framework for evaluating retrieval-based image editing. We then study how to directly modify the input image. We propose FlexIT, a method which iteratively changes the input image until it satisfies an abstract "editing objective" defined in a multimodal embedding space. We introduce a variety of regularization terms to enforce realistic transformations. Next, we focus on diffusion models, which are powerful generative models able to synthetize novel images conditioned on a wide variety of textual prompts. We demonstrate their versatility by proposing DiffEdit, an algorithm which adapts diffusion models for image editing without finetuning. We propose a zero-shot strategy for finding automatically where the initial image should be changed to satisfy the text transformation query. Finally, we study a specific challenge useful in the context of image editing: how to synthetize a novel image by giving as constraint a spatial layout of objects with textual descriptions, a task which is known as Semantic Image Synthesis. We adopt the same strategy, consisting in adapting diffusion models to solve the task without any example. We propose the ZestGuide algorithm, which leverages the spatio-semantic information encoded in the attention layers of diffusion models
Ćdition sĆ©mantique dāimages Ć partir de requĆŖtes textuelles
The aim of this thesis is to propose algorithms for the task of Text-based Image Editing (TIE), which consists in editing digital images according to an instruction formulated in natural language. For instance, given an image of a dog, and the query "Change the dog into a cat", we want to produce a novel image where the dog has been replaced by a cat, keeping all other image aspects unchanged (animal color and pose, background). The north-star goal is to enable anyone to edit their images using only queries in natural language. One specificity of text-based image editing is that there is practically no training data to train a supervised algorithm. In this thesis, we propose different solutions for editing images, based on the adaptation of large multimodal models trained on huge datasets. We first study a simplified editing setup, named Retrieval-based image edit- ing, which does not require to directly modify the input image. Instead, given the image and modification query, we search in a large database an image that corresponds to the requested edit. We leverage multimodal image/text alignment models trained on web-scale datasets (like CLIP) to perform such transformations without any examples. We also propose the SIMAT framework for evaluating retrieval-based image editing. We then study how to directly modify the input image. We propose FlexIT, a method which iteratively changes the input image until it satisfies an abstract "editing objective" defined in a multimodal embedding space. We introduce a variety of regularization terms to enforce realistic transformations. Next, we focus on diffusion models, which are powerful generative models able to synthetize novel images conditioned on a wide variety of textual prompts. We demonstrate their versatility by proposing DiffEdit, an algorithm which adapts diffusion models for image editing without finetuning. We propose a zero-shot strategy for finding automatically where the initial image should be changed to satisfy the text transformation query. Finally, we study a specific challenge useful in the context of image editing: how to synthetize a novel image by giving as constraint a spatial layout of objects with textual descriptions, a task which is known as Semantic Image Synthesis. We adopt the same strategy, consisting in adapting diffusion models to solve the task without any example. We propose the ZestGuide algorithm, which leverages the spatio-semantic information encoded in the attention layers of diffusion models.Lāobjectif de cette theĢse est de proposer des algorithmes pour la taĢche dāeĢdition dāimages baseĢe sur le texte (TIE), qui consiste aĢ eĢditer des images numeĢriques selon une instruction formuleĢe en langage naturel. Par exemple, eĢtant donneĢ une image dāun chien et la requeĢte "Changez le chien en un chat", nous voulons produire une nouvelle image ouĢ le chien a eĢteĢ remplaceĢ par un chat, en gardant tous les autres aspects de lāimage inchangeĢs (couleur et pose de lāanimal, arrieĢre- plan). Lāobjectif de lāeĢtoile du nord est de permettre aĢ tout un chacun de modifier ses images en utilisant uniquement des requeĢtes en langage naturel. Une des speĢcificiteĢs de lāeĢdition dāimages baseĢe sur du texte est quāil nāy a pratiquement pas de donneĢes dāentraiĢnement pour former un algorithme superviseĢ. Dans cette theĢse, nous proposons diffeĢrentes solutions pour lāeĢdition dāimages, baseĢes sur lāadaptation de grands modeĢles multimodaux entraiĢneĢs sur dāeĢnormes ensembles de donneĢes. Nous eĢtudions tout dāabord une configuration dāeĢdition simplifieĢe, appeleĢe eĢdition dāimage baseĢe sur la recherche, qui ne neĢcessite pas de modifier directement lāimage dāentreĢe. Au lieu de cela, eĢtant donneĢ lāimage et la requeĢte de modification, nous recherchons dans une grande base de donneĢes une image qui correspond aĢ la modification demandeĢe. Nous nous appuyons sur des modeĢles multimodaux dāalignement image/texte entraiĢneĢs sur des ensembles de donneĢes aĢ lāeĢchelle du web (comme CLIP) pour effectuer de telles transformations sans aucun exemple. Nous proposons eĢgalement le cadre SIMAT pour eĢvaluer lāeĢdition dāimages baseĢe sur la recherche. Nous eĢtudions ensuite comment modifier directement lāimage dāentreĢe. Nous proposons FlexIT, une meĢthode qui modifie iteĢrativement lāimage dāentreĢe jus- quāaĢ ce quāelle satisfasse un "objectif dāeĢdition" abstrait deĢfini dans un espace dāinteĢgration multimodal. Nous introduisons des termes de reĢgularisation pour imposer des transformations reĢalistes. Ensuite, nous nous concentrons sur les modeĢles de diffusion, qui sont des modeĢles geĢneĢratifs puissants capables de syntheĢtiser de nouvelles images conditionneĢes par une grande varieĢteĢ dāinvites textuelles. Nous deĢmontrons leur polyvalence en proposant DiffEdit, un algorithme qui adapte les modeĢles de diffusion pour lāeĢdition dāimages sans reĢglage fin. Nous proposons une strateĢgie "zero-shot" pour trouver automatiquement ouĢ lāimage initiale doit eĢtre modifieĢe pour satisfaire la requeĢte de transformation de texte
Functional invariants to watermark large transformers
International audienceThe rapid growth of transformer-based models increases the concerns about their integrity and ownership insurance. Watermarking addresses this issue by embedding a unique identifier into the model, while preserving its performance. However, most existing approaches require to optimize the weights to imprint the watermark signal, which is not suitable at scale due to the computational cost. This paper explores watermarks with virtually no computational cost, applicable to a non-blind white-box setting (assuming access to both the original and watermarked networks). They generate functionally equivalent copies by leveraging the modelsā invariance, via operations like dimension permutations or scaling/unscaling. This enables to watermark models without any change in their outputs and remains stealthy. Experiments demonstrate the effectiveness of the approach and its robustness against various model transformations (fine-tuning, quantization, pruning), making it a practical solution to protect the integrity of large models
Generation of long-lived underdense channels using femtosecond filamentation in air
International audienceUsing femtosecond laser pulses at 800 and 400 nm, we characterize the formation of underdense channels in air generated by laser filamentation at the millijoule energy level by means of transverse interferometry. We find that using tight focusing conditions, filamentation generates a shock wave and that the resulting low-density channel lasts for more than 90 ms. Comparison of these results with hydrodynamic simulations using an Eulerian hydrodynamic code gives an good agreement and allows us to estimate the initial gas peak temperature at ā¼ 1000 K. The influence of experimental parameters such as the focusing conditions for the ultrashort laser pulse, its polarization or the wavelength is studied and linked to previous characterizations of filamentation-generated plasma columns